📚 arXiv 论文查看器

包含LLM总结和可折叠图片展示

5
篇论文
8
张图片
📋 论文目录
1. Video Generation Models in Robotics: Applications, Research Challenges, Future Directions
5 张图片
🤖 LLM 总结
📌 背景痛点/本文动机
暂无内容
🚀 核心方法
暂无内容
📈 实验结果
暂无内容
💬 可借鉴之处
暂无内容
📄 摘要 (Abstract)
Video generation models have emerged as high-fidelity models of the physical world, capable of synthesizing high-quality videos capturing fine-grained interactions between agents and their environments conditioned on multi-modal user inputs. Their impressive capabilities address many of the long-standing challenges faced by physics-based simulators, driving broad adoption in many problem domains, e.g., robotics. For example, video models enable photorealistic, physically consistent deformable-body simulation without making prohibitive simplifying assumptions, which is a major bottleneck in physics-based simulation. Moreover, video models can serve as foundation world models that capture the dynamics of the world in a fine-grained and expressive way. They thus overcome the limited expressiveness of language-only abstractions in describing intricate physical interactions. In this survey, we provide a review of video models and their applications as embodied world models in robotics, encompassing cost-effective data generation and action prediction in imitation learning, dynamics and rewards modeling in reinforcement learning, visual planning, and policy evaluation. Further, we highlight important challenges hindering the trustworthy integration of video models in robotics, which include poor instruction following, hallucinations such as violations of physics, and unsafe content generation, in addition to fundamental limitations such as significant data curation, training, and inference costs. We present potential future directions to address these open research challenges to motivate research and ultimately facilitate broader applications, especially in safety-critical settings.
🖼️ 图片和 Caption (5 张)
S0.F1
S0.F1
Overview.As embodied world models, video models generate high-fidelity predictions of the spatiotemporal evolution of real-world environments, capturing fine-grained robot-environment interactions that have been traditionally challenging for classical physics-based simulators. Their remarkable capabilities enable generalist robot policy learning, policy evaluation, and visual planning that is well-aligned with commonsense knowledge.
S2.F3
S2.F3
Diffusion Video Model Architectures.Diffusion/Flow-matching has emerged as the dominant model architecture for training photorealistic controllable video models that can be steered using text, image, and other conditioning inputs. These models broadly utilize diffusion transformers (DiTs) or U-Nets to learn important interpendencies across space and time within a compact latent space.
S3.F4
S3.F4
Video Models for Embodied World Modeling.Video models provide high-quality representations of the physical world, which could be implicit (e.g., latent and video representations) or explicit (e.g., point clouds and Gaussian Splatting models).
S3.F5
S3.F5
Video Models for Data Generation.Video models enable high-fidelity data generation for cost-effective policy learning. Robot actions can be extracted from videos through modular approaches using end-effector pose tracking or end-to-end approaches, such as inverse-dynamics methods.
S3.F6
S3.F6
Dynamics and Rewards Modeling.Video models provide high-accuracy dynamics modeling and rich reward signals, which are essential in reinforcement learning, circumventing long-standing challenges in system identification and reward engineering.
2. Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning
1 张图片
🤖 LLM 总结
📌 背景痛点/本文动机
暂无内容
🚀 核心方法
暂无内容
📈 实验结果
暂无内容
💬 可借鉴之处
暂无内容
📄 摘要 (Abstract)
LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single-shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed-size embeddings to model combinatorial tool compositions. To address these challenges, we proposeToolQP, a lightweight framework that models retrieval as iterative query planning. Instead of single-shot matching,ToolQPdecomposes instructions into sub-tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub-tasks required for composition. We trainToolQPusing synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate thatToolQPachieves state-of-the-art performance, exhibiting superior zero-shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.
🖼️ 图片和 Caption (1 张)
S2.F1
S2.F1
Overview of theToolQPframework. The Planner decomposes a complex user query (e.g., travel planning) into sequential sub-tasks. For each sub-task, it interactively generates queries, processes feedback from the dense retriever, and self-corrects if necessary, before aggregating the final set of relevant tools.
3. Structural Approach to Guiding a Present-Biased Agent
2 张图片
🤖 LLM 总结
📌 背景痛点/本文动机
暂无内容
🚀 核心方法
暂无内容
📈 实验结果
暂无内容
💬 可借鉴之处
暂无内容
📄 摘要 (Abstract)
Time-inconsistent behavior, such as procrastination or abandonment of long-term goals, arises when agents evaluate immediate outcomes disproportionately higher than future ones. This leads to globally suboptimal behavior, where plans are frequently revised or abandoned entirely. In the influential model of Kleinberg and Oren (2014) such behavior is modeled by a present-biased agent navigating a task graph toward a goal, making locally optimal decisions at each step based on discounted future costs. As a result, the agent may repeatedly deviate from initially intended plans.
🖼️ 图片和 Caption (2 张)
S1.F1
S1.F1
Example:T={(b,e)},A={(a,b)}T=\{(b,e)\},\ A=\{(a,b)\}. For initial graphGG, the agent follows the paths​b​f​tsbft. After deletion(b,f)(b,f)—paths​a​d​tsadt. And after adding(a,b)(a,b)—TT-paths​a​b​e​tsabet.
S4.F2
S4.F2
The construction of graphGGfromTheorem3.
4. On Angels and Demons: Strategic (De)Construction of Dynamic Models
0 张图片
📄 摘要 (Abstract)
In recent years, there has been growing interest in logics that formalise strategic reasoning about agents capable of modifying the structure of a given model. This line of research has been motivated by applications where a modelled system evolves over time, such as communication networks, security protocols, and multi-agent planning. In this paper, we introduce three logics for reasoning about strategies that modify the topology of weighted graphs. InStrategic Deconstruction Logic, a destructive agent (the demon) removes edges up to a certain cost. InStrategic Construction Logic, a constructive agent (the angel) adds edges within a cost bound. Finally,Strategic Update Logiccombines both agents, who may cooperate or compete. We study the expressive power of these logics and the complexity of their model checking problems.
5. Beyond Entangled Planning: Task-Decoupled Planning for Long-Horizon Agents
0 张图片
🤖 LLM 总结
📌 背景痛点/本文动机
暂无内容
🚀 核心方法
暂无内容
📈 实验结果
暂无内容
💬 可借鉴之处
暂无内容
📄 摘要 (Abstract)
Recent advances in large language models (LLMs) have enabled agents to autonomously execute complex, long-horizon tasks, yet planning remains a primary bottleneck for reliable task execution. Existing methods typically fall into two paradigms: step-wise planning, which is reactive but often short-sighted; and one-shot planning, which generates a complete plan upfront yet is brittle to execution errors. Crucially, both paradigms suffer from entangled contexts, where the agent must reason over a monolithic history spanning multiple sub-tasks. This entanglement increases cognitive load and lets local errors propagate across otherwise independent decisions, making recovery computationally expensive. To address this, we propose Task-Decoupled Planning (TDP), a training-free framework that replaces entangled reasoning with task decoupling. TDP decomposes tasks into a directed acyclic graph (DAG) of sub-goals via aSupervisor. Using aPlannerandExecutorwith scoped contexts, TDP confines reasoning and replanning to the active sub-task. This isolation prevents error propagation and corrects deviations locally without disrupting the workflow. Results on TravelPlanner, ScienceWorld, and HotpotQA show that TDP outperforms strong baselines while reducing token consumption by up to 82%, demonstrating that sub-task decoupling improves both robustness and efficiency for long-horizon agents.